Fast, energy efficient cmos 2p1r1w register file array using harvested data

ABSTRACT

A transistor memory device includes storage elements storing a capacitance including (1) a capacitance at a source of PFETs, (2) a capacitance at each storage element connected to a storage node and (3) a capacitance at a gate input of inverter transistors from the plurality of transistor storage elements. Each storage element configured to perform (i) a read data access (ii) a write data access, to increase static noise margin. The transistor memory device further includes a harvest node coupled to a ground and that is configured to store a harvested charge transferred from a selected bitline to increase an output voltage at the harvest node. The transistor memory device further includes a capacitor divider configured to maintain a voltage swing on a bitline. The transistor memory device further includes a harvest circuit configured to, in response to the read data access, decouple the harvest node and invert a voltage.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S. patentapplication Ser. No. 17/578,482, filed Jan. 19, 2022 entitled “Fast,Energy Efficient Cmos 2p1r1w Register File Array Using Harvested Data”,which claims priority to U.S. Provisional Application No. 63/247,136,filed Sep. 22, 2021, entitled “Fast, Energy Efficient Cmos 2p1r1wRegister File Array Using Harvested Data”, and U.S. ProvisionalApplication No. 63/138,456, filed Jan. 17, 2021, entitled “Fast, EnergyEfficient Cmos 2p1r1w Register File Array Using Harvested Data”, each ofwhich is hereby incorporated by reference in its entirety. Theapplication claims priority to U.S. Provisional Application No.63/247,136, filed Sep. 22, 2021, entitled “Fast, Energy Efficient Cmos2p1r1w Register File Array Using Harvested Data”.

FIELD

The present disclosure generally relates to digital integrated circuits.In particular, the present disclosure is related to fast, energyefficient CMOS 2P1R1W Register File Array using Harvested Data.

BACKGROUND

While power density of CMOS chips was held constant with constantelectric field (Dennard) scaling for over 30 years, increases in CMOSdevice variability at lower operating voltages and scaled geometries intandem with reductions in circuit speed from non-scaling of gateoverdrive due to exponential increases in leakage from scaling MOSFETthreshold voltages limited CMOS voltages from scaling to much below 1 V.These trends brought an end to Dennard scaling in (FIG. 1 a ) in 2004.At constant voltage scaling, power density increases as the cube ofscaling factor limiting processor clock frequencies to below 5 GHzduring the last 15 years.

SUMMARY

In some embodiments, a transistor memory device includes a plurality oftransistor storage elements storing a collective capacitance including(1) a capacitance at a source terminal of each p-channel field-effecttransistors (PFETs) from a plurality of PFETs, (2) a capacitance at eachtransistor storage element from the plurality of transistor storageelements electrically connected to a storage node and (3) a capacitanceat a gate input of a plurality of inverter transistors from theplurality of transistor storage elements. Each transistor storageelement from the plurality of transistor storage elements includes aword line port configured to select (a) a bitcell and (b) a firstbitline or a second bitline. Each transistor storage element from theplurality of transistor storage elements is configured to perform (i) aread data access from or (ii) a write data access to each remainingtransistor storage element from the plurality of transistor storageelements, to increase a static noise margin in response to a decrease ofa read current and a voltage on the storage node. The collectivecapacitance of the plurality of transistor storage elements is greaterthan a terminal capacitance of the selected bitline. The transistormemory device further includes a harvest node electrically coupled to aground and that is configured to store a harvested charge transferredfrom the selected bitline to increase an output voltage at the harvestnode. The transistor memory device further includes a capacitor dividerelectrically connected between the selected bitline and the harvest nodeof a first transistor storage element from the plurality of transistorstorage elements that shares the selected bitline and the harvest node.The capacitor divider is configured to maintain a voltage swing on theselected bitline. The transistor memory device further includes aharvest circuit electrically coupled to the harvest node and configuredto, in response to the read data access performed by the firsttransistor storage element, decouple the harvest node from the groundand invert a voltage equal to a potential difference between theselected bitline and the harvest node.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1A-1B is an illustration of a graphs depicting an end of DennardScaling, where CMOS performance is limited by cubic increase in powerdensity with non-scaling of operating voltage and heat removal with moresophisticated and expensive packaging possible, but not for much longer,as red diamonds, according to some embodiment.

FIG. 2 is an illustrative representation of energy consumption limitedby energy cost of moving data, according to some embodiment.

FIG. 3 is an illustrative representation of dataflows depictingimproving energy efficiency with maximum data reuse and local RF access,according to some embodiment.

FIG. 4A-4B is a schematic illustration of a conventional 2P 1R1WRegister File bit path and an illustrative representation of a graphdepicting waveforms during Read Access in a conventional 2P 1R1WRegister File bit path, according to some embodiment.

FIG. 5 is a schematic illustration of a layout of a conventional 2P RFbitcell, according to some embodiment.

FIG. 6A-6B is an illustrative representation of a PBTI stress conditionon N2 equivalent to seen in transistor NR1 of RF bitcell and VT shiftdue to PBT1 in SRAM bitcells over a period of 100 M secs (3 years)(worse for wider devices): 10 mV-15 mV with sigma VT adder: 2 mV-4 mV,according to some embodiment.

FIG. 7A-7B is a schematic illustration of an array architecture andAssist circuits of an 8 KB RF Array in 16 FF CMOS using anon-hierarchical 8:1 column multiplexing for writes and an illustrativerepresentation of a graph depicting Wiring parasitic parameters of 185fF/um and 0.95 ohms/sq for Mx lines, according to some embodiment.

FIG. 8 is an illustrative representation of dimensions and wiringparasitics of RF array are used in the design of peripheral circuits tocompare metrics of performance and energy efficiency of component usingeither proposed or conventional circuits. The array shown below assumesGlobal I/O and Control in the middle and not at the bottom as seen inFIG. 6 . This because the bit path wire resistance could be lesslimiting in response time, according to some embodiment. Local Decode,Datapath Control Logic—includes Local WL decode logic, Reset, LBLpre-charge, data path harvest control.

Global I/O, Control—Address<0:10>, Data in, out<0:31>, CLK, R/W.

8:1 Column mux for Write assumed. Global I/O & Control placed in middleof instance to limit R of pitch constrained Global BLs.

RWL, WWL: 128 b (100 um): Cw=128×0.767 um×1.02×0.185 fF/um=18.52 fF.(R=0.95 ohms/sq with double metal lines RRWL=475 ohms)

LBL: 16 b (3.45 um): Cw=16×0.18 um×1.2×0.185 fF/um=0.64 fF. RLBL=33 ohmsGRBL, GWBL: 224 b (50.4 um): Cw=224×0.18 um×1.25×0.185 fF/um=9.324 fFRGRBL=475 ohms.

FIG. 9A-9B is a schematic illustration of a circuit schematic of aproposed 2P 1R1W Register File bit path. LBL response to a WL selectedge and the accompanying harvest of signal charge from LBL to V2L,according to some embodiment. This sensing scheme eliminates the needfor a Sense Amp Enable signal (when differential sensing is used) andits accompanying overheads in performance, power and area, emulates abitcell with twice the read current and consumes much less power withself-disabling action when sensed data is captured. The proposed schemeis energy efficient relative to large signal sensing as well since thesedissipate all of the charge on the LBL. Conv circuits continuedischarging the LBL even after sensed data has been latched. Moreover,harvested charge from the GRBL can lower Write energy by over 30% usingharvested charge on V2 and have more available to further reduce energyconsumed by WL drivers, decoders and control ckts.

FIG. 10 is a schematic illustration of a RF bitcell with GND contact ofRead Stack replaced with V2L in proposed scheme—wire that runs parallelto and is similar t the LBL in length & capacitance, according to someembodiment.

FIG. 11 is an illustrative representation of a higher harvest voltage onglobal harvesting node in each column, V2G self-limits signaldevelopment on GRBL with substantial reduction in Global bitline energyconsumption, according to some embodiment. V2G: 20.2 mm so thatCV2G/CGRBL−0.4. This capacitance divider ratio drives limited chargefrom Read signal developed on the GRBL to be driven to a higher voltageenabling faster sensing action:V_(V2G)=ΔQ/C_(V2G)=ΔV_(GRBL)·(C_(GRBL)/C_(V2G)).

FIG. 12A-12B is an illustrative representation of a generation andsynchronization of bitpath control signals and a graph depictingwaveforms from circuit simulations, according to some embodiment.

FIG. 13 is an illustrative representation of a graph depicting withnoise of 0.3V applied to Gate input of NR1 in the read stack, for longerWL pulse widths, the LBL is discharged through the read stack, flippingthe output of the Global Read BL incorrectly due to the disturb noise inthe half-selected RF cell in conventional bitpaths without the keepercircuits, according to some embodiment.

FIG. 14 is an illustrative representation of a graph noise of 0.3Vapplied to Gate input of NR1 in the read stack of the RF bitcell, V2Lasymptotically increases to equalize the noise voltage while disablingthe read stack by lowering Gate overdrive of NR1 to below VT into thesubthreshold region. The read stack in the bitcell thus cannot evaluatethe LBL to an incorrect value—as it would when conventional RF arrayperipheral circuits are used (FIG. 12A or FIG. 12B), according to someembodiment.

FIG. 15A is a schematic illustration of a decode path of Block Selectand RWL. Not shown (for simplicity) is RE·CLK′ that gates pre-decideroutputs to each Block.

FIG. 15B is an illustrative representation of a decode path for WWL, anda decode stage used instead of a cony static CMOS NAND2 corresponding togates highlighted in blue in FIG. 15 a , according to some embodiment.

FIG. 16A is a schematic illustration of GWBL (D_in) drivers using chargeharvested on the V2 grid to lower their energy consumption by over 30%,according to some embodiment.

FIG. 16B is an illustrative representation of a chematic of GWBL (D_in)driver that uses charge harvested on V2G (in FIGS. 9, 11 ) to lower itsenergy consumption by over 30%, according to some embodiment. Currentdrawn from VDD by this harvest charge using driver is shown by the redwaveform above and compared to the current drawn from VDD by aconventional driver (shown by the blue waveform).

FIG. 17 is an illustrative representation of charge harvested from GRBLon to V2G in each bit column during a Read access is moved to a V2 gridas shown in FIG. 11 before the next Read access, according to someembodiment. V2 lines are connected enabling local decode, control,global I/O and control circuits to use this aggregate of harvestedcharge as well. Harvested charge on the V2 grid is immediately availablefor GWBL line drivers to use during a Write access. For a typical MACoperation, a Write access for every 3 Read accesses leave substantialcharge on the V2 grid reservoir for Decode, control, I/O circuits andfor components external to the array to use.

FIG. 18A-18B is an illustrative representation of a graph depictingvoltage of charge on harvest grid V2 asymptotically approaches (withonly Read operations) the voltage of node V2G schematic shown in FIG. 9and a graph depicting the relative activity of a Write column, the V2grid approaches 0.5 V, that enables harvester in FIG. 15 to lower GWBLdriver energy by over 30%, according to some embodiment.

FIG. 19A-19B is an illustrative representation of a graph depictingvoltage waveform and WL->G_out delay components along a Read Bitpath ina conventional RF Array and a graph depicting volt age and currentwaveform of to Bitpath in a conventional RF array, according to someembodiment.

FIG. 20A is an illustrative representation of a graph depicting voltagewaveform and WL->G_out delay components along a Read Bitpath in an RFArray with proposed circuits, according to some embodiment.

FIG. 20B is an illustrative representation of a graph depicting avoltage waveform WL->G_out delay components along a Read Bitpath in anRF Array with proposed circuits and with LVT devices NR1 and NR2 in thebit cell. Leakage from the array is unchanged, but WL Select->GlobalData_out improves the equivalent delay in a conventional RF Array byover 50% (P1 see. FIG. 19 a and Table I for more comparisons), accordingto some embodiment.

FIG. 20C is an illustrative representation of a graph depicting a 2×2layout of the 16 FF Foundry 1R1W bitcell showing opportunity to lowerthe VT of the decoupled Read stack using an LVT mask (region highlightedin dashed box) for even higher performance without being constrained byleakage of the Read stack when using proposed harvesting schemes,according to some embodiment.

FIG. 20D is an illustrative representation of a graph depicting a Readand Write Bitpath response in Proposed Harvesting scheme, according tosome embodiment. Quantitative comparison of WL->Data_out delay andbitpath energy consumption.

FIG. 21 is an illustrative representation of a comparison of Read andWrite Energy Consumption of Proposed charge harvesting scheme withconventional RF array designs, according to some embodiment.

FIG. 22 is a schematic illustration of leakage paths along decoupledRead stack in conventional arrays (top) and in the arrays with proposedcircuits (bottom), according to some embodiment. In proposed scheme,unlike conventional arrays, leakage is independent of the number ofbitcells per LBL, of NFET Read Stack device VT and of data stored inbitcell,

FIG. 23 is an illustrative representation of a graph depicting leakagealong the Read path in propped RF arrays can be orders of magnitudelower for array designs with 16 or more bitcells per LBL, according tosome embodiment. Leakage along Read path of proposed RF arraysindependent of number of Bit cells per LBL, independent of device VT ofread stack NFETs in bit cell and also independent of data stored inbitcell.

DETAILED DESCRIPTION

The end of Dennard scaling end the lack of greater instruction-levelparallelism forced the industry to switch from a single-energy-intensivecore per microprocessor to multiple efficient cores per chip withroll-outs of the industry's first dual. The move to parallel processingallowed each core to be more energy efficient by having a lower peakperformance (at reduced supply voltage), with multiple cores on the dieto increase the overall throughput performance. With Dennard scalingdead since 2004 and Moore's Law slowing to a doubling of transistorcount every 20 years, transistors are not getting much faster while thepeak power per mm² increases because voltages cannot scale anymore.Power budgets cannot increase either due to heat removal limits (FIG. 1b ). Thus, performance limits on CMOS processors have been increasinglyimposed by their energy efficiency.

The energy consumption for various arithmetic operations and memoryaccesses in FIG. 2 shows the relative cost dominated by energyconsumption of data movement (red) that is higher than arithmeticoperations (blue).

Large last-level caches are included on the CPU chip to scale memorystall time with performance by lowering the miss rate of the processor'scaches. Since most of the memory bitcells are idle most of the time, theenergy dissipation of large on-chip CPU cache memory is dominated by itsleakage. The importance of memory leakage is evident from the fractionof processor power consumed by leakage in large caches with caches andregister files (RF) consuming over 50% of the CPU's energy.

GPUs are widely preferred over CPUs to accelerate AI workloads becauseDeep Neural Network. (DNN) model training is composed of simple matrixmath and convolution calculations, the speed of which can be greatlyenhanced if the computations can be carried out in parallel. GPUs usetens of thousands of threads to pursue high throughput performance withextreme multithreading. Extreme multithreading requires a complex threadscheduler as well as a large register file, which is expensive to accessboth in terms of energy and latency. In GPUs, the bottleneck for DNNprocessing is in the memory read access—with eachmultiply-and-accumulate (MAC) operation requiring three-memory readaccesses and one memory write access. Row Stationary Dataflows (FIG. 3 )that maximize data reuse and local accumulation of data are more energyefficient. FIG. 3 shows the energy consumption by the RF contributing tonearly 70% of the energy of a MAC operation for the more energyefficient row stationary dataflow.

Each thread in a GPU must store its register context on-chip. UnlikeCPUs that hide latency of a single thread by using a large last-levelon-chip cache, GPUs use a large number of threads and switch betweenthem to hide memory access latency. Just holding the register context.of these threads requires substantial on-chip storage. With so manythreads, register files are one of the largest on-chip memory resourcein current GPUs. Recently announced commercial GPUs report aggregateon-chip RF array sizes up- to 256 Mb—much larger than last-level,on-chip caches in CPUs

Note that while this paper details the circuit schemes proposed for a 2port 1R1W 8T register file bitcell array, these are easily extended toRegister File arrays with additional Read Ports by adding an NFETtransistor pair (corresponding to the decoupled Read stack in the ‘Readport’ box in FIG. 9 ) for each additional Read port i with the gateinput of the lower NFET-NR1 i driven by the cell storage node and thegate input of the upper NFET in the stack, NR2 i driven by RWLi. Thesource terminal of NR1 i is connected to a harvesting node V2Li for eachRead port i.

Similarly, each additional Write port j is added to the schematic inFIG. 9 of the 1R1W bitcell by adding NFET PG devices N3 j and N4 j thatconnect the cell storage nodes to an added pair of local Writebitlines—BLj, BLBj with the added pair of NFET PG devices N3 j, N4 jdriven by WWLj at their gate inputs. The peripheral circuits associatedwith the local and global BL for each read/write port i/j are identicalto those described in FIG. 8 .

2. Conventional Two-Port 1R1W Register File Array Circuits

2-Port Register File bitcells FIGS. 4, 5 ) provide faster signaldevelopment rates on the BL and demonstrate lower VMIN when compared toconventional 6T SRAM bitcells. Primarily used when both Read and Writeaccess to memory are desired in the same cycle for high performanceprocessors, 2P RF cells use fast NFET transistors in the read stack toaccomplish higher read current at the decoupled read port of thebitcell. The decoupling of the read stack allows a higher readperformance without being required to trade it off for higher readstability margins as is required in the 6T SRAM cell. The decoupled readstack also allows the Write margin at low voltages to be independentlyoptimized for lower VMIN. The fast NFET stack (NR1 NR2) in FIG. 4 a , 5driving the decoupled read port in the 2P RF bitcell typically optimizedfor performance, is also typically leakier than other bitcell devices.

The conventional RT-bitpath assumed serves as a baseline referencerelative to which improvements are typically reported by industry andacademia alike. All of these recent (within last 4 years) referencesassume this ‘Domino Read’ full-swing technique as the baseline referenceto compare their Register File array implementations with.

2.1 Full-Swing, Short-BL sensing with Logic Gates: Small signaldifferential sensing—typically used in 6T arrays due to small areaoverheads and robust operation, is not as attractive for RF arraysbecause differential sense amps do not track delay scaling in logiccircuits and because the small signal development rate on the bitlinedepends on bitline loading capacitance—dominated by local interconnectsin each bitcell which don't scale with device geometries. The scaling oftransistor dimensions also degrades random mismatch at the senseamplifier input that translates into larger sense amplifier voltageoffsets the BL signal must overcome as a performance overhead.

Alternative large signal sensing schemes for RF arrays, shown in FIG. 4use a NAND gate and short bit lines (16/32 bits/BL). In this scheme,static CMOS circuits for sensing, short bitlines and rail-rail swings onbitlines eliminate the performance scaling issues seen with differentialsensing while the global bitline at an upper metal level routes senseddata across the height of the array at low resistance. This scheme iswidely adopted across industry for RF arrays enabling them to delivermuch higher performance in GPUs and scale it with logic gate technologyat a high cost in (GPU) chip size and in switching and leakage power.

2.2 Dynamic Read-Access: Dynamic circuits that precharge output nodes sothey evaluate much faster on arrival of the clock edge with inputsstable during evaluation—are found in practically all fast-memoryarrays. Precharge of local and global bitlines and their evaluation bybitcells at the arrival edge of the Read WL select transition are anexample in 2P RF bitcell arrays. However, these techniques are energyinefficient since all of the charge discarded (from the LBL and the GRBLin FIG. 4 ) to the reference ground potential during evaluate must beresupplied during the BL precharge phase before the next Read cycle. Ina typical RF Array Instance, as many as 256 local BL columns areaccessed by a Read WL in an 8 KB instance. However, only a few rows inthe word direction are selected during the same cycle (Read WL, WriteWL, Precharge and a few other control signals in the Word direction)making the bit path in an RF array from bitcell to output latch, thedominant (>95%) energy consumption component in an RF array.

2.3 Disturb Current Read Failure avoidance with BL Keeper: The readstack also increases the risk of read failure from disturb current ifdata at cell node ‘Bit’ in FIG. 4 is a ‘0’ during a concurrent read andwrite access long the same WL. Because the Write WL half selects thebitcell (write BL pair [BL, BLx] are both precharged to VDD), the cellnode ‘Bit’ at ‘0’ typically rises 100-150 mV due to the voltage divideracross N2 and N3, partially turning on NFET NR1 in the read stack. Atrelevant Fast N, high T corners (where additional noise by way of VTreductions due to temperature, process and random variations) the localBL (LBL) begins evaluating (with a lower read current) as the gate inputof NR1 rises to an effective noise voltage assumed as 300 mV in circuitsimulations below. This noise level is sufficient for the bitpath toread out the wrong data at slower cycle times, given a sufficiently widedistribution of Read current in the RF bitcell. The industry-wideadopted solution for this read failure mechanism is to add keeper deviceKP driven by feedback inverter K1 from the local BL shown in FIG. 4 .The impact on signal development time on the LBL at the low T, slow NPcorners where the selected bitcell must fight the keeper KP (already inthe saturation region) harder to develop signal can be as high as a 20+%signal development time degradation.

2.4 An Industry Solution to Disturb Current read failure: Onealternative solution to the keeper described above for disturb currentread failure has been to use PFETs instead of NFETs for access devicesin the RF bitcell driven-by the Write WL using precharged-low LocalWrite BLs in half-selected bitcells during simultaneous read and writeaccess of the RF bitcell. This eliminates the voltage bump at the gateof the lower NFET NR1 in the Read stack when ‘Bit’ is 0, but Ion of NR1is degraded by up to 35% due to a drop in the high node storage level at‘Bit’ when both RWL and WWL in the same row are simultaneously turnedon—effectively degrading read current. The RWL voltage by 15-20% torecover performance when using Write PFET access transistors toeliminate Disturb Current driven Read failure. The power & areaoverheads in doing so appear significant given the size of bootstrapcapacitors required to deliver sufficient charge to the WL. Also, thissolution assumes approximately equal drive strengths of NFETs and PFETsdue to the introduction of embedded Si/Ge source/drain that enhanceshole mobility. Absent this feature in older CMOS platforms, othercomplications of lowering write margins substantially (and raising writeVMIN) could arise when using weaker PFET gates instead of NFETs asaccess devices driven by the write WL.

2.5 High Leakage through Fast Read Stack: Another negative consequenceof the use of the Keeper PFET solution is that when the Bitline is heldat VDD by the keeper during active or standby mode, all bitcellsattached to a Local Bitline, are draining high leakage current from thebitline (due to a drop of almost VDD across the top NFET of the readstack) through an already leaky stack—some of which are worse (whosebitcells have ‘Bit’=1 turning on the lower of the two devices in theRead stack). This leakage path is ‘live’ for practically every LBLacross the aggregate RF array in a GPU that is powered on. The presenceof a keeper circuit also holds the LBL at VDD following a read accesswhere the Bit read in the column was a ‘0’.

2.6 Reliability of NR1 in read stack: NMOS Transistor aging mostlyarises from positive bias temperature instability (PBTI), hot carrierinjection (HCI), time-dependent dielectric breakdown (TDDB) andelectro-migration (EM). In an NFET stack as shown in FIG. 6 below from,with a ‘1’ at terminal ‘B’ (equivalent to the storage node that drivesthe gate input of transistor NR1 in the decoupled Read stack of the 1R1Wbitcell), the transistor N2 (equivalent to transistor NR1 in the 1R1Wbitcell) will see the most PBTI stress with VDD asserted across its gateoxide at its Source and Drain terminals over extended periods. VT shiftof the PD-SRAM bitcell transistors due to PBTI are reported in forstress times up to 100 M secs (3 years) of 10-15 mV which can add toaging from HCI to degrade read stack current/performance and variabilityeven further. With a full VDD across the gate insulator of NR1 (andalong the channel of NR2 due to the Keeper) for extended times, forbitcells storing ‘Bit’=1, the above voltage accelerated aging mechanismsdue to high-sustained vertical & lateral fields in NR1 & NR2 can lead toPBTI degradation of RF read current and its variability.

New CMOS harvesting circuits are proposed that improve componentperformance and substantially lower the energy cost of moving dataacross 2-port/multiport Register File (2P/MP RF) arrays typicallyimplemented in GPU based AI Hardware accelerators. These circuits lowerswitching energy in local and global bitpaths by over 70% for Read andby over 30% for Write when engaging harvested data to self-limit energydissipation during a memory access. They also lower bitcell leakagecurrents along the Read transistor stack pair by over an order ofmagnitude as a result of-self-disabling of current flow by the risingelectric potential barrier of harvested charge.

Proposed sensing circuits double signal development circuit speed alonglocal and global bitlines by comparing a decreasing BL voltage to theincreasing electric potential of harvested charge as the evaluationenergy expended on the local or global bit path is harvested. Theseimprovements in sensing speed reduce by up to 50+% the WL Select toOutput Data delay in a conventional RF array. The proposed bit pathcircuits also engage harvested charge to provide immunity to disturbcurrent noise during concurrent Read and Write access along aWL—eliminating the performance, area and energy overheads of BL keepercircuits used in conventional 2 port RF Memory arrays.

Proposed circuits improve the reliability of Read performance-limitingbitcell devices by lowering of voltages across their terminals usingharvested charge during most of active and standby periods. Areaoverheads of proposed circuits are expected to be marginal based ondevice widths of replacements to conventional peripheral circuits andcan be further minimized by sharing of devices and their connectionsbetween bit slices of the array. Moreover, proposed circuits do notrequire any changes to the CMOS platform, to the bitcell or to the arrayarchitecture with much of the flow for design, verification and test of2P RF Memory arrays expected to remain unchanged—minimizing risk andallowing integration of proposed circuits into existing products withminimal disruption to schedule and cost. Circuit Simulations are run ona 16 nm FinFET CMOS technology using ASU parameter decks developed andavailable on a public domain. Additional data on wiring parasitics wereobtained from IEDM/ISSCC publications by the foundry of wiringparasitics and bitcell geometries. The Array architecture assumed insimulations is mostly identical to that reported by the Foundry exceptfor a few opportunities to improve circuit and wiring delays.

3. Example Array for Circuit Analysis and Comparison

To be able to make quantitative-comparisons between proposed circuitsand those used by baseline industry standard designs, a simple, common 8KB RF Array architecture (FIG. 7 a ) in 16 nm FF CMOS is assumed. 16 nmhigh performance and low power device parameter decks from ASU are usedin HSPICE circuit stimulations with technology parameters for 16 nm CMOSwriting parasitics from IEDM (FIG. 7 b ) publications by the Foundry.Cell Dimensions (FIG. 5 ) and wiring parasitics of this RF array areused in the design of peripheral circuits to compare metrics ofperformance and energy efficiency using either proposed or conventionalcircuits

The 8 KB array, shown in FIG. 7 a has eight 1 Kbyte ‘blocks’ or‘segments’, each with pairs of 16 b×128 b subarrays using short 16 bBLs. Local peripheral bitpath circuits are placed between the subarraysin a pair on either side of the block. Local Write & Read decoders andcontrol circuits are placed in between the pairs of subarrays. GlobalI/O and control for the 8 KB instance are placed at the bottom of thiscolumn of 8 Blocks as shown in FIG. 7 a . The only change to this array(shown in FIG. 8 ) in the analysis below is the placement of Global I/O,Control and CLK circuits in the middle instead of the bottom, to limit Rof pitch constrained Global RAY BLs. Relevant wire R, C and dimensionsare shown in FIG. 8 . Lateral and vertical dimensions of the arrayassume a 20% array efficiency where a 20% overhead in X and Y directionsare assumed for peripheral circuits.

The ASU decks along with the wiring parasitic data from TSMC reported atIEDM are used in the same array architecture with the same bitcellassumed in both—the baseline reference Register File array as well asthe proposed charge harvesting circuit schemes. This apple-applecomparison is what this paper mostly relies on to make quantitativecomparisons of performance and power metrics from circuit simulations.

4. Operation of Proposed 2P RF Array Bitpath

4.1 Harvest of LBL & GRBL Evaluation Energy: In the proposed scheme, theSource terminal of the NFET read stack in the RF bitcell, NR1 shown inFIGS. 9, 10 is connected to pin ‘V2L’ a metal line shared by all V2Lterminals of bitcells that share the same local BL. V2L has a comparablecapacitance and resistance to the local BL.

The Read access proceeds as with a conventional RF bitcell, except thatcharge flowing into the selected bitcells (with ‘Bit’=1) from theprecharged Local BL in any given column—is harvested on V2L. Thisharvesting action raises the voltage on V2L at the same time that LBLloses charge, practically doubling the signal development rate assertedat the gate-source input of the sense-amp (inverter I1 with NFET footerLBR1), until the Read stack self-disables. (Note that the implementationcould use a NAND gate instead of inverter I1 with the other input of theNAND driven by a Column select signal if the column is selected by thecolumn multiplexor. The self-disabling action occurs when the read stackdevices of the selected bitcells have insufficient gate overdrive tostay in the linear region and move into the subthreshold region as LBLand V2L coverage in voltage (Shown by Red and Green waveforms in FIG. 8for local or global bit paths). In this scheme, logic circuits used,deliver the benefit of scaling sensing speed with the CMOS platformwithout the burden of having to consume the energy of full swingoperation—as conventional RF arrays are required to.

In FIG. 9 a , the GND terminal for a column of 16 bitcells, for thedecoupled read stack only, has been replaced with the local harvestingnet V2L. The total capacitance of this net is comparable to the totalcapacitance of the local BL 16 bits long because the wire length in bothcases is the same and because the diffusion capacitance contributions bythe S/D terminals of NR1 and NR2 to V2L and to LBL respectively is thesame.

So, when charge moves from LBL to V2L on selection of any of thebitcells along this column by a Read WL (RWL), at any time, the changein voltage (reduction of LBL and increase in V2L voltage) is about thesame. This is verified in FIG. 9 b (at bottom) that to first order theLBL converges to the same voltage as V2L when the WL is selected.

The capacitance of V2L is fixed and cannot be changed to charge V2L to adifferent voltage. So, the sensing inverter for the local BL, I1triggers when LBL and V2L are within a VT of each other causing itsoutput L_out to make a 0→1 transition as seen in FIG. 9 b as well. Forthe Global Read BL, the GND terminal of the GRBL evaluation NFET: GBE inFIG. 9 a is replaced with the harvesting node V2G. Since the GRBL wirecapacitance is large (GRBL spans across all blocks), it is advantageousto raise V2G to a higher voltage as it harvests charge from GRBL, sothat a smaller voltage swing on GRBL would be sufficient to resolve thedate. This is accomplished by sizing the length of the net V2G to 40% ofGRBL (as shown in FIG. 11 ) so that a small drop in the GRBL voltage asGBE evaluates it, swings V2G up by 2.5× the value of this small swing.As can be seen in FIG. 20 , the GRBL drops by only 250 mV because of thecapacitance divider sharing harvested charge between GRBL and V2G:

From charge conversion, initial charge=final charge

So,

C _(GBRL) *V _(DD)=(C _(GRBL) +C _(V2G))=final charge

Since C_(V2G)=0.4 C_(GBRL) (FIG. 10 )

we get.

V2G _(final) =[C _(GBRL)/(C _(GBRL) +C _(V2G))]*V_(DD)=[1/1.4]*0.85V=0.61V

FIG. 20 shows the GRBL (and V2G) settling to this voltage of 0.61V afterself-disabling GRBL evaluation, saving a substantial amount of energyper column while also driving the output node of the global sensinginverter I2 in FIG. 9 , G_out in less time.

4.2 Fast, energy and area efficient Sense amp action: As the LBL voltagedrops, the gate input voltage of I1 approaches I1's logic threshold,which itself moves to a higher voltage of V2L rises with more harvestedcharge. As the LBL voltage meets the rising logic threshold voltage ofI1, the output of I1 L_out rises fast due to the high gain of a CMOSinverter. Since L_out directly drives the gate input of NFET GBB, GBEturns on and the precharged Global Read BL (GRBL) begins discharging assoon as L_out makes its 0→1 transition past the device threshold voltageof NFET GBE.

The precharged Global GRBL discharges to V2G instead of discharging toGND as in the conventional Global RF bitpath. As with the LBL, theconverging voltages on GRBL and V2G trigger a low→high transition at theoutput of inverter I2. A dropping GRBL voltage meets the rising logicthreshold voltage of I2. The converging waveforms of GRBL and V2G (redand green waveforms at bottom of FIG. 9 b ) self-disable the NFET GBE.

Note that since the V2L net has about the same capacitance anddimensions as the LBL. The and V2L voltages thus converge to about thesame value—VDD/2, by this balanced capacitive divider when they sharecharge. If V2L were to have a smaller capacitance, V2L could rise to ahigher voltage and self-limit the LBL to discharging less than half ofits charge. Given the impracticality of using a shorter V2L line (mustconnect to the GND terminal of the Read stack in each of the RF bitcells along a LBL) and given the smaller capacitance of the LBL(compared to the much larger and longer GRBL), an imbalanced capacitivedivider is pursued in the Global BL to raise the voltage of V2G higherthan ½ V_(DD) so that V2G can self-limit GRBL discharge sooner, at avoltage closer to V_(DD) than to GND and can this consume much lesscharge from the VDD grid during a Read access.

FIG. 11 shows the V2G line at about 40% of the length of GRBL—requiringthe L_out nets in each bit column from the furthest Blocks 0, 1, 6 & 7to be routed over an additional 6.9 Um (about 1.3 fF). Thus, the Globalbitpath circuits NFET GBE, inverter I2 and reset NFETs GBR1 and GRB2 areplaced b/w blocks 2 & 3 and b/w block 4 & 5. This placement allows V2Gto rise to over 70% of VDD limiting the charge lost by the GRBL (to V2G)on evaluate to less than 30% of what is lost from an equivalentindustry-standard RF Global Read BL. Note that the sense amp action isstill much faster than the full-swing approach in conventional arraysbecause the signal development rate seen by I2 is double of what wouldbe available from discharge of a Global Read BL in a conventional RFarray.

4.3 Reset of Dynamic nodes before Read Access: The Block Select signalfrom pre-decoders (FIG. 12A or FIG. 12B) triggers a set of 4 interlockedpulses to condition the local and global Read bitpath before the RWLselect edge arrives. They condition the bitpath for fast evaluate andalso condition the harvesting nodes V2L and V2G to ‘reset’ to GND beforethe selected bit cells begin evaluating. Charge harvested on V2G foreach bit column from a previous Read is first moved to the storage gridV2 by GRB1 whose gate is driven by pulse RTS1 is that it discharge V2Lwhen RST1 drives gate input of NFET LBR1. Discharge of V2L has theeffect of causing the output of I1 to discharge to GND which is whereV2L is driven to by the pulse RST1 at gate input of LBR1. RST1 isasserted concurrently on the gate input of NFET GBR1 to move harvestedcharge on V2G to the harvesting grid V2.

Now that L_out is discharged and GBE is turned off, GRBL can beprecharged to VDD from its partially discharged state from a previousRead access. Once RST1 has moved charge from V2G to V2, RST2 ‘resets’V2G to GND readying it for the impending Read. Also, since L_out hasbeen discharged during RST1, the NFET GBE is turned off enabling theprecharged GRBL to hold its precharge voltage of VDD when V2G isdischarged to GND to RST2.

All of the 4 signal outputs shown in FIG. 12A or FIG. 12B are generatedoff the Block select signal during a Read access in the sequence shownaccording to when each of the 4 signals are triggered off the Blockselect path. Systematic variations in Process/Voltage/Temp impact all ofthese gates in proximity to each other, but design considerations on thepulses from the point of generation to point of use within the blockrequire sufficient width of the pulse. For e.g., the Fast-Slow cornerfor N and P channel FETs respectively at low T could cause the activehigh pulse (Resets 1, 2 to disappear. Similarly, Slow-Fast corner for Nand P channel FETs respectively at low T impacts the active low pulse(local, global precharge). These and other risks would need to besimulated across all relevant corners to enable robust operation. Randomvariations in device characteristics are unlikely to be significantsince these circuits will not be using small geometry devices.

4.3 Immunity to Disturb Current Failure: The proposed scheme does notrequire keeper circuitry found in conventional RF array bit paths toavoid read failure when RWL and WWL concurrently select the same row ofbit cells as seen in a conventional bitpath. This is illustrated in thecircuit simulations of a conventional bitpath without keeper circuits.Cell noise at node ‘Bit’—modeled with a voltage bump at the gate of NR1,can initiate an unintended discharge of the LBL—as seen in FIG. 13 ,when RWL selects the noisy bitcell. FIG. 13 shows a Read failureoccurring when the WL pulse is long enough (and/or if the operating T orvoltage or process corner or random VT fluctuations in the Read stackincrease read current). The NAND output evaluates incorrectly to VDD,causing the Global Read BL in the conventional RF array to dischargewhen the LBL voltage drops below the logic threshold of the NAND. The‘keeper’ solution used by conventional RF arrays that avoids the abovedisturb current failure, however, increases the WL select→G_out delaysby over 20%.

When using the proposed bitpath circuits, keepers are not required sincethe rising voltage on V2L due to noise voltage at the gate of NFET NR1,self-disables the discharge of the LBL as V2L asymptotically approachesthe noise voltage (FIG. 14 ). The LBL and GRBL can thus be seen in FIG.14 as maintaining their precharge state of VDD or close enough to VDDwithout evaluating incorrectly as the conventional RF array would in thescenario described above.

4.4 Compact, fast Decoders: FIG. 15 shows a fast, compact alternative tostatic CMOS gates. Large fan-outs can be driven by decode stagesupstream when smaller loads per fanout are being driven. Since thedecode stage outputs (from their inverters) are typically active high,each stage evaluates only when the preceding stages evaluate. Thiseliminates the need for outputs of preceding stages to drive PFETS aswell. The CLK·RE or CLK·WE active high signals drive the full CMOS input‘A’ in the schematic shown in FIG. 15 restricting switching activity toonly the path selected by stable address bits—input B for example asshown in the 2 input AND gate in FIG. 15 .

4.5Write Data Path: For a Multiply Accumulate operation, 3 reads and aWrite access are typical. Thus, with an 8:1Write column multiplexer, aWrite access exercises a bit column for about every 24 exercised by aRead access. FIG. 16 shows the data path for a Write access with theGlobal Write BL (GWBL) driving data to be written across the height ofthe array with an 8:1 column mux driving this data down the selectedlocal WBL pair.

The GWBL driver schematic in FIG. 16 b shows parts highlighted in bluethat have devices with much smaller widths (˜⅕ of driver transistors).The NOR gate in this schematic generates an active high pulse whoseleading edge is triggered by a 1→0 transition at the input and whosetrailing edge is triggered by a 0→1 transition at the output. Theleading edge of this active high pulse turns on NFET N2 which beginscharging the output with charge harvested on V2. The leading edge ofthis pulse is inverted and delayed to turn on PFET P1 which charges theoutput from the VDD grid since the voltage at the output can be chargedto no more than the voltage at V2 by NFET N2. The rise in output voltagedisables the path from V2 to output with the trailing edge of the activehigh pulse output of the NOR gate. The PFET P1 completes the outputcharge to VDD. The presence of a small geometry PFET keeper whose gateinput is driven by IN and whose drain terminal is connected to OUT canhelp avoid any floating nodes. The GWBL driver in a conventional RFarray consumes a substantial fraction of the energy expended during aWrite access given the large GWBL capacitance and given the large numberof GWBL lines being driven (32). The harvest charge during inverterschematic in FIG. 16 lowers the energy consumed by a conventionalinverter for the same purpose by over 30% as seen in the waveforms inFIG. 16 b . The voltage waveforms from the proposed GWBL driver and anequivalent inverter used in a conventional RF array (with the same GWBLload) are practically identical. Most of the area overhead is from NFETN2 and is not expected to increase the footprint of a conventionalinverter by much more than 60-70%.

4.6 Metal Grid that holds Harvest Charge: V2 lines are charged up byRead accesses as shown in (at top of) FIG. 18 a asymptoticallyapproaching the maximum voltage (0.62V) set by the voltage V2G is drivento (seen in waveform of V2G in FIG. 9 as 0.62V) by the imbalancedcapacitive divider between GRBL and V2G in FIG. 11 . Since a Writecolumn is expected to be exercised once every 24 times a Read column isexercised, including Write column accesses at this 1:24 frequencydistribution b/w Writes and Reads shows the harvest grid voltage stablearound 0.5V (in FIG. 18 b ). This voltage of V2 is sufficient to lowerenergy consumed by the GWBL driver by as much as 30%.

4.7 Circuit Speed, Switching Energy Comparisons: FIG. 19 a , FIG. 20 aand FIG. 20 b show circuit speed comparisons between conventional RFarray peripheral circuits on the one hand and proposed circuits on theother that use the same bitcell and that use an RF bitcell with LVTNFETS in the Read stack. The WL select 4 G_out delay components showimprovements of 36+% (same bitcell) and 50+% (bitcell with NR1 and NR2NFETS as LVT devices). FIG. 19 b , FIG. 20 c and FIG. 21 show the totalcharge consumed from the power supply by Conventional RF array designs &RF arrays with Proposed circuits with quantitative comparisons organizedin Table-I.

The improvements in Read performance of the RF bitcell demonstrated inFIG. 20 b without increasing leakage (as seen in FIG. 23 ) is realizedfrom use of LVT transistors in the decoupled Read stack. While using LVTdevices in a bitcell is typically not pursued due to substantialincreases in leakage, the 2×2 layout of four adjacent 8T cells in FIG.20 c offers the option to use a LVT mask at no additional cost in area,performance, leakage or additional masks with the LVT mask extendingacross the column of bitcells

MM 09_21_008_2021

TABLE I Comparison of Performance & Energy consumption of ProposedCircuits in 8 KB RF Array with Conventional Circuits Read Bitpath WriteBitpath WL−> Data_out Energy Energy Comparison of RF % % % Array MetricsDelay Improvement Energy Improvement Energy Improvement RF Array with68.97 ps — 17.26 fJ  — 13.63 fJ  — Conventional Circuits RF Array with43.90 ps 36.3% 5.09 fJ 70.5% 9.69 fJ 28.9% Proposed Circuits RF Arraywith  34.2 ps 50.4% 5.09 fJ 70.5% 9.69 fJ 28.9% Proposed Circuits usingLVT NFETs in Read Stack of Cell

Note: As shown in FIG. 23 , there is no change in the leakage current ofthe array between the bottom 2 rows of the above Table

4.8 Leakage reduction: FIG. 22 shows the schematics of the leakage pathsin the bit cell Read stack in conventional RF arrays and in the RFarrays using proposed circuits. There is no easy outlet for charge toleak away using proposed circuits where V2L, V2G collect evaluationcharge and leakage charge as well—that easily leak away in aconventional RF array. The leakage path using proposed circuits isrestricted through the LBR1 NFET footer only. The leakage of LBR1 isindependent of the device VTs of the NFET read stack (low VT limits setby the ability to resolve data when excess leakage present in a columnof bitcells) in the RF bitcell and is also independent of the number ofbit cells that share a LBL. The BL and the harvesting node could floatup to VDD without consequence to data stored in the 6T part of the RFbitcell. The higher this voltage, the more efficiently charge isharvested by the reset operation directly before a Read access.

1. A Register File memory device comprising: a plurality of conventional8 transistor 2 port storage elements each with 1 read port and 1 writeport and each with a decoupled read stack of a pair of NFETs with thegate input of one driven by a Read word line and the gate input of theother in the pair driven by a cell storage node. a harvest terminal thatreplaces the reference ground potential terminal of the decoupled readstack of FETs in a conventional Register File storage element. a harvestcircuit coupled to the harvest terminal of a plurality of storageelements whose Read ports are coupled along a common bitline with theharvest circuit responsive to a read access by self-disabling thedevelopment of signal on the bitline, eliminating the uncertainty ofsignal voltage development on the bitline due to the statisticalvariation of read current read stack and at least doubling the rate atwhich data sensed in the selected storage element is resolved.
 2. Anapparatus, comprising: a plurality of transistor storage elements, atransistor storage element from the plurality of transistor storageelements including a read port and a write port, the transistor storageelement electrically from the plurality of transistor storage elementscoupled to a first n-channel field effect transistor (NFET) and a secondNFET, the second NFET including a gate terminal configured to be drivenby a bitcell including a read word line, the first NFET including asource terminal and a gate terminal such that the gate terminal of thefirst NFET is configured to be driven by a cell storage node for theread port from the transistor storage element; a bitline electricallycoupled to the read port of the transistor storage element from theplurality of transistor storage elements and configured to beprecharged; a harvesting node electrically coupled to the sourceterminal of the first NFET and configured to harvest, at the harvestingnode, voltage that was precharged at the bitline in response to a readaccess action and an activation of the cell storage node; a harvestinverter including a reference ground potential terminal configured tobe replaced with the harvesting node, the harvest inverter including agate terminal configured to be electrically coupled to the read port ofthe transistor storage element from the plurality of transistor storageelements; and a harvesting grid electrically coupled to the harvestingnode, the harvesting grid configured to self-disable a signaldevelopment on the bitline to eliminate an uncertainty of the signaldevelopment on the bitline when an electric potential of harvested dataon the harvesting node matches a voltage at the bitline from the signaldevelopment.
 3. The apparatus of claim 2, wherein the harvesting nodeconfigured to discharge a voltage to the reference ground potentialbefore the read access through the first NFET having an active highpulse at the gate terminal of the first NFET enabling discharge of thevoltage.
 4. The apparatus of claim 3, wherein the gate terminal of theharvest inverter is electrically coupled to the bitline, the harvestinverter including an output terminal configured to be triggered by (1)a decreasing electric potential difference between the bitline that isprecharged and the harvesting node during the read access when the cellstorage node is active and electrically coupled to the first NFET or thesecond NFET or (2) a voltage substantially equal to a voltage at a powersupply terminal that is electrically coupled to the transistor storageelement, the output terminal configured to be triggered by movement ofelectric charge from the gate terminal of the harvest inverter to theharvesting node.
 5. The apparatus of claim 2, wherein: the apparatus isconfigured to perform a global sensing scheme such that the gateterminal of the harvest inverter is electrically coupled to a globalbitline that is electrically coupled to a drain terminal of the firstNFET and a drain terminal of the second NFET, the gate terminal of thefirst NFET and the gate terminal of the second NFET being driven by thebitline, the reference ground potential terminal of the harvest inverterconfigured to be replaced with a global harvest terminal that iselectrically coupled to the source terminal of the first NFET, theglobal harvest terminal configured to harvest charge from the globalbitline that is precharged as the first NFET is triggered by a risingoutput at an output terminal of the harvest inverter.
 6. The apparatusof claim 2, wherein the transistor storage element from the plurality oftransistor storage elements is configured to be decoupled from the firstNFET and the second NFET to enable a higher read performance withoutcompromising read stability margins.
 7. The apparatus of claim 2,wherein a discharge of voltage at the bitline is configured to stop inresponse to a rising voltage at the harvesting node asymptoticallyapproaching a noise voltage at the gate terminal of the first NFET. 8.The apparatus of claim 2, wherein a voltage signal developed at thebitline is determined by a capacitive divide electrically coupledbetween the bitline and the harvesting node.
 9. The apparatus of claim2, wherein harvested charge stored at the harvesting node is configuredto self-disable a flow of read current as the harvested charge at theharvesting node approaches a noise voltage at the bit storage node. 10.The apparatus of claim 2, wherein the signal development on the bitlineis configured to self-disable as the electric potential of the harvesteddata at the harvesting node rises to equalize a dropping voltage of thebitline.
 11. The apparatus of claim 2, wherein the harvesting grid isconfigured to self-disable when the first NFET and the second NFET haveinsufficient gate overdrive.
 12. The apparatus of claim 2, wherein acapacitance of the harvesting node is fixed and cannot be changed tocharge the harvesting node to a different voltage.
 13. The apparatus ofclaim 2, wherein a voltage at the harvesting node is configured toincrease while the bitline loses charge, to increase a signaldevelopment rate for the voltage signal.
 14. The apparatus of claim 2,wherein a change in voltage, in response to a charge being transferredfrom the bitline to the harvesting node when the bitline including theread word line is selected, is the same as when a bitline including awrite word line is selected.
 15. The apparatus of claim 2, furthercomprising an inverter electrically coupled to the bitline and theharvesting node, the inverter including an input terminal and an outputterminal, the output terminal configured to perform a low-to-hightransition in response to a voltage at the bitline and a voltage at theharvesting node being within a voltage threshold.
 16. The apparatus ofclaim 2, wherein increasing a voltage at the harvesting node occurs at asame time as and at a same rate of a voltage at the bitline is lowered.